reduce the buffer when using high dimensional data in distributed mode. #2485

guolinke · 2019-10-01T15:14:59Z

thvasilo · 2019-10-02T13:54:22Z

src/io/dataset_loader.cpp

+  bool force_findbin_in_single_machine = false;
+  if (Network::num_machines() > 1) {
+    int total_num_feature = Network::GlobalSyncUpByMin(num_col);
+    size_t esimate_sync_size = BinMapper::SizeForSpecificBin(config_.max_bin) * total_num_feature;


This will still lead to overflow, need to cast the operands before assigning to a wider type.

Suggested change

size_t esimate_sync_size = BinMapper::SizeForSpecificBin(config_.max_bin) * total_num_feature;

size_t esimate_sync_size = (size_t)BinMapper::SizeForSpecificBin(config_.max_bin) * (size_t)total_num_feature;

thvasilo · 2019-10-02T13:54:45Z

src/io/dataset_loader.cpp

+  bool force_findbin_in_single_machine = false;
+  if (Network::num_machines() > 1) {
+    int total_num_feature = Network::GlobalSyncUpByMin(dataset->num_total_features_);
+    size_t esimate_sync_size = BinMapper::SizeForSpecificBin(config_.max_bin) * total_num_feature;


Same as above, avoids overflow.

Suggested change

size_t esimate_sync_size = BinMapper::SizeForSpecificBin(config_.max_bin) * total_num_feature;

size_t esimate_sync_size = (size_t)BinMapper::SizeForSpecificBin(config_.max_bin) * (size_t)total_num_feature;

thvasilo · 2019-10-02T13:58:58Z

These changes broke something in distributed training, now datasets that were fine before also throw an MPI error on an MPI_Recv call somewhere:

salloc -N 3 mpiexec --machinefile /shared/hostnames.txt ../../lightgbm config=train.conf data=/shared/data/avazu-app.val num_trees=3 tree_learner=data
salloc: Granted job allocation 21
[LightGBM] [Warning] data is set=/shared/data/avazu-app.val, data=/shared/kdda.t will be ignored. Current value: data=/shared/data/avazu-app.val
[LightGBM] [Warning] tree_learner is set=data, tree_learner=data will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/avazu-app.val, data=/shared/kdda.t will be ignored. Current value: data=/shared/data/avazu-app.val
[LightGBM] [Warning] tree_learner is set=data, tree_learner=data will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/avazu-app.val, data=/shared/kdda.t will be ignored. Current value: data=/shared/data/avazu-app.val
[LightGBM] [Warning] tree_learner is set=data, tree_learner=data will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Local rank: 2, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 1, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 0, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Finished loading data in 2.873416 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Finished loading data in 2.883152 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Finished loading data in 2.892466 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Total Bins 9926
[LightGBM] [Info] Total Bins 9935
[LightGBM] [Info] Total Bins 9891
[LightGBM] [Info] Number of data: 650820, number of used features: 9901
[LightGBM] [Info] Number of data: 651651, number of used features: 9912
[LightGBM] [Info] Number of data: 651480, number of used features: 9869
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129703 -> initscore=-1.903591
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129677 -> initscore=-1.903820
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129048 -> initscore=-1.909405
[LightGBM] [Info] Start training from score -1.905605
[LightGBM] [Info] Start training from score -1.905605
[LightGBM] [Info] Start training from score -1.905605
[ip-172-31-59-36:05691] *** An error occurred in MPI_Recv
[ip-172-31-59-36:05691] *** reported by process [478871553,1]
[ip-172-31-59-36:05691] *** on communicator MPI_COMM_WORLD
[ip-172-31-59-36:05691] *** MPI_ERR_TRUNCATE: message truncated
[ip-172-31-59-36:05691] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-59-36:05691] ***    and potentially your MPI job)
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: ip-172-31-56-120
  Local PID:  5770
  Peer host:  ip-172-31-59-36
--------------------------------------------------------------------------
salloc: Relinquishing job allocation 21
salloc: Job allocation 21 has been revoked.

guolinke · 2019-10-03T15:04:29Z

[LightGBM] [Info] Total Bins 9926
[LightGBM] [Info] Total Bins 9935
[LightGBM] [Info] Total Bins 9891

I find the total bin of different machines are different.
Did you use the same config for different nodes, and without pre-partition dataset?

guolinke · 2019-10-03T15:05:41Z

[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129703 -> initscore=-1.903591
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129677 -> initscore=-1.903820
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129048 -> initscore=-1.909405

and different init_score, it seems the dataset used in these machine are different

thvasilo · 2019-10-03T15:08:24Z

That is indeed weird. My data and LightGBM distribution lie on shared NFS, so all nodes have access to the same data.

As I said, the exact same dataset (avazu-app.val) trains to completion on the current master, so I think this PR has some kind of side-effect on the data loading.

guolinke · 2019-10-03T15:20:20Z

@thvasilo sorry, I realize there is one more place i need to fix too, I will update this soon.

thvasilo · 2019-10-03T17:19:44Z

Tried the current PR, avazu-app.val with 1M features will now error out with [LightGBM] [Fatal] Too many features for distributed model, buffer is not enough. It is better to pass categorical feature directly instead of sparse high dimensional feature vectors.

But I'm wondering if this is a regression or expected behavior as the same dataset trains fine on master, but as before with different numbers of bins and init values, which I guess is bug?:

salloc -N 3 mpiexec --machinefile hostnames.txt ../../lightgbm config=train.conf data=/shared/data/avazu-app.val num_trees=3 tree_learner=data
salloc: Granted job allocation 12
[LightGBM] [Warning] data is set=/shared/data/avazu-app.val, data=binary.train will be ignored. Current value: data=/shared/data/avazu-app.val
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/avazu-app.val, data=binary.train will be ignored. Current value: data=/shared/data/avazu-app.val
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/avazu-app.val, data=binary.train will be ignored. Current value: data=/shared/data/avazu-app.val
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Local rank: 0, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 2, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 1, total number of machines: 3
LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 2, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 1, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Finished loading data in 2.554492 seconds
[LightGBM] [Info] Finished loading data in 2.557398 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Finished loading data in 2.561811 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Total Bins 3952
[LightGBM] [Info] Total Bins 3949
[LightGBM] [Info] Total Bins 3952
[LightGBM] [Info] Number of data: 650820, number of used features: 3925
[LightGBM] [Info] Number of data: 651651, number of used features: 3925
[LightGBM] [Info] Number of data: 651480, number of used features: 3925
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129703 -> initscore=-1.903591
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129677 -> initscore=-1.903820
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129048 -> initscore=-1.909405
[LightGBM] [Info] Start training from score -1.905605
[LightGBM] [Info] Start training from score -1.905605
[LightGBM] [Info] Start training from score -1.905605
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.376638
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.37674
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.37546
[LightGBM] [Info] Iteration:1, training auc : 0.73422
[LightGBM] [Info] Iteration:1, valid_1 binary_logloss : 1.20257
[LightGBM] [Info] Iteration:1, valid_1 auc : 0.5
[LightGBM] [Info] 0.252487 seconds elapsed, finished iteration 1
[LightGBM] [Info] Iteration:1, training auc : 0.734659
[LightGBM] [Info] Iteration:1, valid_1 binary_logloss : 1.20257
[LightGBM] [Info] Iteration:1, valid_1 auc : 0.5
[LightGBM] [Info] 0.252733 seconds elapsed, finished iteration 1
[LightGBM] [Info] Iteration:1, training auc : 0.735349
[LightGBM] [Info] Iteration:1, valid_1 binary_logloss : 1.20257
[LightGBM] [Info] Iteration:1, valid_1 auc : 0.5
[LightGBM] [Info] 0.243124 seconds elapsed, finished iteration 1
[LightGBM] [Info] Iteration:2, training binary_logloss : 0.369988
[LightGBM] [Info] Iteration:2, training binary_logloss : 0.368788
[LightGBM] [Info] Iteration:2, training binary_logloss : 0.370093
[LightGBM] [Info] Iteration:2, training auc : 0.742183
[LightGBM] [Info] Iteration:2, valid_1 binary_logloss : 1.21594
[LightGBM] [Info] Iteration:2, valid_1 auc : 0.5
[LightGBM] [Info] 0.466288 seconds elapsed, finished iteration 2
[LightGBM] [Info] Iteration:2, training auc : 0.741538
[LightGBM] [Info] Iteration:2, valid_1 binary_logloss : 1.21594
[LightGBM] [Info] Iteration:2, valid_1 auc : 0.5
[LightGBM] [Info] 0.469050 seconds elapsed, finished iteration 2
[LightGBM] [Info] Iteration:2, training auc : 0.742812
[LightGBM] [Info] Iteration:2, valid_1 binary_logloss : 1.21594
[LightGBM] [Info] Iteration:2, valid_1 auc : 0.5
[LightGBM] [Info] 0.457309 seconds elapsed, finished iteration 2
[LightGBM] [Info] Iteration:3, training binary_logloss : 0.364572
[LightGBM] [Info] Iteration:3, training binary_logloss : 0.363365
[LightGBM] [Info] Iteration:3, training binary_logloss : 0.364677
[LightGBM] [Info] Iteration:3, training auc : 0.7449
[LightGBM] [Info] Iteration:3, valid_1 binary_logloss : 1.23792
[LightGBM] [Info] Iteration:3, valid_1 auc : 0.5
[LightGBM] [Info] 0.698352 seconds elapsed, finished iteration 3
[LightGBM] [Info] Iteration:3, training auc : 0.745682
[LightGBM] [Info] Iteration:3, valid_1 binary_logloss : 1.23792
[LightGBM] [Info] Iteration:3, valid_1 auc : 0.5
[LightGBM] [Info] 0.688194 seconds elapsed, finished iteration 3
[LightGBM] [Info] Iteration:3, training auc : 0.744545
[LightGBM] [Info] Iteration:3, valid_1 binary_logloss : 1.23792
[LightGBM] [Info] Iteration:3, valid_1 auc : 0.5
[LightGBM] [Info] 0.707128 seconds elapsed, finished iteration 3
[LightGBM] [Info] Finished training
[LightGBM] [Info] Finished training
[LightGBM] [Info] Finished training
salloc: Relinquishing job allocation 12
salloc: Job allocation 12 has been revoked.

guolinke · 2019-10-04T00:00:26Z

@thvasilo did it throw the warning "Communication cost is too large for distributed dataset loading, using single mode instead."?

guolinke · 2019-10-04T00:17:09Z

if the warning exists, i think the error is caused by overflow.
BTW, there is a bug in the number of bin calculations in the master branch, for dist model. fixed in this branch.

thvasilo · 2019-10-04T12:25:35Z

Unfortunately new errors popped up, on the datasets I've tried

This is for avazu-app.t (1M features)

salloc -N 3 mpiexec --machinefile hostnames.txt ./lightgbm config=examples/mpi/train.conf data=/shared/data/avazu-app.val num_trees=3 tree_learner=data
salloc: Granted job allocation 7
[LightGBM] [Warning] data is set=/shared/data/avazu-app.val, data=binary.train will be ignored. Current value: data=/shared/data/avazu-app.val
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/avazu-app.val, data=binary.train will be ignored. Current value: data=/shared/data/avazu-app.val
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/avazu-app.val, data=binary.train will be ignored. Current value: data=/shared/data/avazu-app.val
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Local rank: 0, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 1, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 2, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Finished loading data in 2.444463 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Finished loading data in 2.453785 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Finished loading data in 2.473461 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Total Bins 7850
[LightGBM] [Info] Total Bins 7850
[LightGBM] [Info] Total Bins 7850
[LightGBM] [Fatal] Check failed: offset == num_total_bin at /shared/LightGBM/src/treelearner/feature_histogram.hpp, line 747 .

[LightGBM] [Fatal] Check failed: offset == num_total_bin at /shared/LightGBM/src/treelearner/feature_histogram.hpp, line 747 .

[LightGBM] [Fatal] Check failed: offset == num_total_bin at /shared/LightGBM/src/treelearner/feature_histogram.hpp, line 747 .

[LightGBM] [Fatal] Check failed: offset == num_total_bin at /shared/LightGBM/src/treelearner/feature_histogram.hpp, line 747 .

[LightGBM] [Warning] [LightGBM] [Warning] [LightGBM] [Warning] Check failed: offset == num_total_bin at /shared/LightGBM/src/treelearner/feature_histogram.hpp, line 747 .

Check failed: offset == num_total_bin at /shared/LightGBM/src/treelearner/feature_histogram.hpp, line 747 .

<... same errors continue, process exits with error>

And this is for kdda.t (20M)

alloc -N 3 mpiexec --machinefile hostnames.txt ./lightgbm config=examples/mpi/train.conf data=/shared/data/kdda.t num_trees=3 tree_learner=data
salloc: Granted job allocation 8
[LightGBM] [Warning] data is set=/shared/data/kdda.t, data=binary.train will be ignored. Current value: data=/shared/data/kdda.t
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/kdda.t, data=binary.train will be ignored. Current value: data=/shared/data/kdda.t
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/kdda.t, data=binary.train will be ignored. Current value: data=/shared/data/kdda.t
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Local rank: 1, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 2, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 0, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Fatal] Check failed: num_total_features_ == static_cast<int>(bin_mappers->size()) at /shared/LightGBM/src/io/dataset.cpp, line 225 .

[LightGBM] [Fatal] Check failed: num_total_features_ == static_cast<int>(bin_mappers->size()) at /shared/LightGBM/src/io/dataset.cpp, line 225 .

[LightGBM] [Info] Finished loading data in 4.468614 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results

<Process stuck, needed to terminate Ctrl+C>

guolinke · 2019-10-04T12:46:58Z

yeah, actually, the master branch is okay, expect the print information is not right.
My previous fix for it introduce a new bug. Now it should work.

thvasilo · 2019-10-04T12:47:37Z

Thanks for the prompt fix!

avazu-app.t will train fine now, still getting the same for kdda.t though.

 salloc -N 3 mpiexec --machinefile hostnames.txt ./lightgbm config=examples/mpi/train.conf data=/shared/data/kdda.t num_trees=3 tree_learner=data
salloc: Granted job allocation 12
[LightGBM] [Warning] data is set=/shared/data/kdda.t, data=binary.train will be ignored. Current value: data=/shared/data/kdda.t
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/kdda.t, data=binary.train will be ignored. Current value: data=/shared/data/kdda.t
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/kdda.t, data=binary.train will be ignored. Current value: data=/shared/data/kdda.t
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Local rank: 0, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 1, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 2, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Fatal] Check failed: num_total_features_ == static_cast<int>(bin_mappers->size()) at /shared/LightGBM/src/io/dataset.cpp, line 225 .

[LightGBM] [Fatal] Check failed: num_total_features_ == static_cast<int>(bin_mappers->size()) at /shared/LightGBM/src/io/dataset.cpp, line 225 .

[LightGBM] [Info] Finished loading data in 4.564516 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
^C
salloc: Relinquishing job allocation 12

guolinke · 2019-10-04T12:59:23Z

Thanks! I will find out what cause that checking failed.

guolinke · 2019-10-04T13:30:11Z

@thvasilo I think the latest commit should fix the check error. Could you try it again? Thanks very much!

thvasilo · 2019-10-04T13:35:07Z

Thanks @guolinke it does indeed work now!

One last question I have is about initscore is it supposed to be the same for all workers?

Here's some example outputs for kdda:

[LightGBM] [Info] Number of positive: 442845, number of negative: 67457
[LightGBM] [Info] Number of positive: 442845, number of negative: 67457
[LightGBM] [Info] Number of positive: 442845, number of negative: 67457
[LightGBM] [Info] Total Bins 44054
[LightGBM] [Info] Total Bins 44054
[LightGBM] [Info] Total Bins 44054
[LightGBM] [Info] Number of data: 170354, number of used features: 8663
[LightGBM] [Info] Number of data: 170291, number of used features: 8663
[LightGBM] [Info] Number of data: 169657, number of used features: 8663
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.867359 -> initscore=1.877803
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.867297 -> initscore=1.877268
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.868772 -> initscore=1.890142
[LightGBM] [Info] Start training from score 1.881737
[LightGBM] [Info] Start training from score 1.881737
[LightGBM] [Info] Start training from score 1.881737

and avazu-app.val:

[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Total Bins 7850
[LightGBM] [Info] Total Bins 7850
[LightGBM] [Info] Total Bins 7850
[LightGBM] [Info] Number of data: 651480, number of used features: 3925
[LightGBM] [Info] Number of data: 651651, number of used features: 3925
[LightGBM] [Info] Number of data: 650820, number of used features: 3925
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129703 -> initscore=-1.903591
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129048 -> initscore=-1.909405
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129677 -> initscore=-1.903820
[LightGBM] [Info] Start training from score -1.905605
[LightGBM] [Info] Start training from score -1.905605
[LightGBM] [Info] Start training from score -1.905605

guolinke · 2019-10-04T13:38:10Z

the init_score is sync. you can find:

[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129703 -> initscore=-1.903591
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129048 -> initscore=-1.909405
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129677 -> initscore=-1.903820
[LightGBM] [Info] Start training from score -1.905605
[LightGBM] [Info] Start training from score -1.905605
[LightGBM] [Info] Start training from score -1.905605

the last 3 lines are the sync score in 3 nodes.

guolinke · 2019-10-04T13:40:14Z

Thanks @thvasilo very much! now the distributed mode can be efficiently ran over sparse features too.

thvasilo · 2019-10-04T13:55:40Z

One thing I noticed, when setting min_data_in_leaf = 1 I get the following error, but only for kdda.t, avazu-app.val is works fine:

[ip-172-31-52-7:06473] *** Process received signal ***
[ip-172-31-52-7:06473] Signal: Segmentation fault (11)
[ip-172-31-52-7:06473] Signal code: Address not mapped (1)
[ip-172-31-52-7:06473] Failing at address: 0x100011
[ip-172-31-52-7:06473] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f2e6c4e6390]
[ip-172-31-52-7:06473] [ 1] ./lightgbm(_ZN8LightGBM16GetConfilctCountERKSt6vectorIbSaIbEEPKiii+0x9)[0x4fd0b9]
[ip-172-31-52-7:06473] [ 2] ./lightgbm(_ZN8LightGBM10FindGroupsERKSt6vectorISt10unique_ptrINS_9BinMapperESt14default_deleteIS2_EESaIS5_EERKS0_IiSaIiEEPPiPKimiiib+0x802)[0x50a302]
[ip-172-31-52-7:06473] [ 3] ./lightgbm(_ZN8LightGBM19FastFeatureBundlingERKSt6vectorISt10unique_ptrINS_9BinMapperESt14default_deleteIS2_EESaIS5_EEPPiPKimRKS0_IiSaIiEEdiidbb+0x5ee)[0x50c2fe]
[ip-172-31-52-7:06473] [ 4] ./lightgbm(_ZN8LightGBM7Dataset9ConstructEPSt6vectorISt10unique_ptrINS_9BinMapperESt14default_deleteIS3_EESaIS6_EEiRKS1_IS1_IdSaIdEESaISB_EEPPiPKimRKNS_6ConfigE+0x23d)[0x50d21d]
[ip-172-31-52-7:06473] [ 5] ./lightgbm(_ZN8LightGBM13DatasetLoader31ConstructBinMappersFromTextDataEiiRKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EEPKNS_6ParserEPNS_7DatasetE+0x1d80)[0x527cf0]
[ip-172-31-52-7:06473] [ 6] ./lightgbm(_ZN8LightGBM13DatasetLoader12LoadFromFileEPKcS2_ii+0x1ed)[0x52aa8d]
[ip-172-31-52-7:06473] [ 7] ./lightgbm(_ZN8LightGBM11Application8LoadDataEv+0x24a)[0x443f4a]
[ip-172-31-52-7:06473] [ 8] ./lightgbm(_ZN8LightGBM11Application9InitTrainEv+0x181)[0x445171]
[ip-172-31-52-7:06473] [ 9] ./lightgbm(main+0x49)[0x440199]
[ip-172-31-52-7:06473] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f2e6c12b830]
[ip-172-31-52-7:06473] [11] ./lightgbm(_start+0x29)[0x442779]
[ip-172-31-52-7:06473] *** End of error message ***

If I set the parameter to 2 (or other values > 1) this does not happen. Seems like a corner case :/

For me the current PR is good enough, just something to keep in mind.

guolinke · 2019-10-04T14:34:19Z

@thvasilo it is not trival to locate what cause the fail, I add a possible fix, you can have a try 😄

thvasilo · 2019-10-04T14:48:48Z

Now it takes much longer, but still get an error. I'd just note this as a known issue if it is to be included in the release, as you said this would be hard to track down.

[ip-172-31-51-139:08446] *** Process received signal ***
[ip-172-31-51-139:08446] Signal: Segmentation fault (11)
[ip-172-31-51-139:08446] Signal code: Address not mapped (1)
[ip-172-31-51-139:08446] Failing at address: 0x5331
[ip-172-31-51-139:08446] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f9b541a9390]
[ip-172-31-51-139:08446] [ 1] ./lightgbm(_ZN8LightGBM16GetConfilctCountERKSt6vectorIbSaIbEEPKiii+0x9)[0x4fd0b9]
[ip-172-31-51-139:08446] [ 2] ./lightgbm(_ZN8LightGBM10FindGroupsERKSt6vectorISt10unique_ptrINS_9BinMapperESt14default_deleteIS2_EESaIS5_EERKS0_IiSaIiEEPPiPKimiiib+0x802)[0x50a302]
[ip-172-31-51-139:08446] [ 3] ./lightgbm(_ZN8LightGBM19FastFeatureBundlingERKSt6vectorISt10unique_ptrINS_9BinMapperESt14default_deleteIS2_EESaIS5_EEPPiPKimRKS0_IiSaIiEEdiidbb+0x5ee)[0x50c2fe]
[ip-172-31-51-139:08446] [ 4] ./lightgbm(_ZN8LightGBM7Dataset9ConstructEPSt6vectorISt10unique_ptrINS_9BinMapperESt14default_deleteIS3_EESaIS6_EEiRKS1_IS1_IdSaIdEESaISB_EEPPiPKimRKNS_6ConfigE+0x23d)[0x50d21d]
[ip-172-31-51-139:08446] [ 5] ./lightgbm(_ZN8LightGBM13DatasetLoader31ConstructBinMappersFromTextDataEiiRKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EEPKNS_6ParserEPNS_7DatasetE+0x1d80)[0x527cf0]
[ip-172-31-51-139:08446] [ 6] ./lightgbm(_ZN8LightGBM13DatasetLoader12LoadFromFileEPKcS2_ii+0x1ed)[0x52aa8d]
[ip-172-31-51-139:08446] [ 7] ./lightgbm(_ZN8LightGBM11Application8LoadDataEv+0x24a)[0x443f4a]
[ip-172-31-51-139:08446] [ 8] ./lightgbm(_ZN8LightGBM11Application9InitTrainEv+0x181)[0x445171]
[ip-172-31-51-139:08446] [ 9] ./lightgbm(main+0x49)[0x440199]
[ip-172-31-51-139:08446] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f9b53dee830]
[ip-172-31-51-139:08446] [11] ./lightgbm(_start+0x29)[0x442779]
[ip-172-31-51-139:08446] *** End of error message ***

guolinke · 2019-10-05T09:02:52Z

Thanks, I think I need to debug for that. Could the fail reproduce when run in the single node?

thvasilo · 2019-10-07T09:56:02Z

Running on a single node (no MPI parallel training) seems to be working fine.

One question I had: when I set min_data_in_leaf = 1 the number of total bins and used features changes, is this expected behavior? May I ask why that is in that case?

min_data_in_leaf = 1

[LightGBM] [Info] Total Bins 2268113
[LightGBM] [Info] Number of data: 510302, number of used features: 1118008

min_data_in_leaf = 50

[LightGBM] [Info] Total Bins 83942
[LightGBM] [Info] Number of data: 510302, number of used features: 26463

guolinke · 2019-10-07T10:58:49Z

@thvasilo yeah, it is expected behavior.
LightGBM will pre-prune the feature that cannot split.
when you set min_data_in_leaf to k, for a very sparse feature with non-zero samples smaller than k, will be pruned.

guolinke · 2019-10-08T04:01:06Z

@thvasilo could you try one more time? If this still failed, I think I cannot fix it temporarily.

StrikerRUS · 2019-10-15T20:10:31Z

@guolinke
Just noticed this warning:

[ 43%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/file_io.cpp.o
/__w/1/s/src/io/dataset_loader.cpp: In member function ‘void LightGBM::DatasetLoader::ConstructBinMappersFromTextData(int, int, const std::vector<std::basic_string<char> >&, const LightGBM::Parser*, LightGBM::Dataset*)’:
/__w/1/s/src/io/dataset_loader.cpp:984:49: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
       if (sample_values.size() <= start[rank] + i) {
                                                 ^

thvasilo · 2019-10-18T14:13:21Z

Hello @guolinke sorry I got caught up with other stuff. Can confirm that training on kdda.t (20M features) with min_data_in_leaf=1 works as expected.

Example output:

salloc -N 3 mpiexec --machinefile ~/hostnames.txt ../../lightgbm config=train.conf data=/shared/data/kdda.t num_trees=3 tree_learner=data min_data_in_leaf=1                                                                                                                                
salloc: Granted job allocation 21
[LightGBM] [Warning] data is set=/shared/data/kdda.t, data=binary.train will be ignored. Current value: data=/shared/data/kdda.t
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] min_data_in_leaf is set=1, min_data_in_leaf=50 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/kdda.t, data=binary.train will be ignored. Current value: data=/shared/data/kdda.t
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] min_data_in_leaf is set=1, min_data_in_leaf=50 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/kdda.t, data=binary.train will be ignored. Current value: data=/shared/data/kdda.t
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] min_data_in_leaf is set=1, min_data_in_leaf=50 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Local rank: 2, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 0, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 1, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Finished loading data in 14.300335 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Finished loading data in 14.709807 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Finished loading data in 14.874291 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Number of positive: 442845, number of negative: 67457
[LightGBM] [Info] Number of positive: 442845, number of negative: 67457
[LightGBM] [Info] Number of positive: 442845, number of negative: 67457
[LightGBM] [Info] Total Bins 2050839
[LightGBM] [Info] Total Bins 2050839
[LightGBM] [Info] Total Bins 2050839
[LightGBM] [Info] Number of data: 169657, number of used features: 1010155
[LightGBM] [Info] Number of data: 170291, number of used features: 1010155
[LightGBM] [Info] Number of data: 170354, number of used features: 1010155
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.867297 -> initscore=1.877268
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.868772 -> initscore=1.890142
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.867359 -> initscore=1.877803
[LightGBM] [Info] Start training from score 1.881737
[LightGBM] [Info] Start training from score 1.881737
[LightGBM] [Info] Start training from score 1.881737
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.375075
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.377174
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.37734
[LightGBM] [Info] Iteration:1, training auc : 0.717413
[LightGBM] [Info] Iteration:1, valid_1 binary_logloss : 0.967661
[LightGBM] [Info] Iteration:1, valid_1 auc : 0.5
[LightGBM] [Info] 8.772978 seconds elapsed, finished iteration 1
[LightGBM] [Info] Iteration:1, training auc : 0.718757
[LightGBM] [Info] Iteration:1, valid_1 binary_logloss : 0.967661
[LightGBM] [Info] Iteration:1, training auc : 0.719905
[LightGBM] [Info] Iteration:1, valid_1 binary_logloss : 0.967661
[LightGBM] [Info] Iteration:1, valid_1 auc : 0.5
[LightGBM] [Info] 7.957943 seconds elapsed, finished iteration 1
[LightGBM] [Info] Iteration:1, valid_1 auc : 0.5
[LightGBM] [Info] 8.864428 seconds elapsed, finished iteration 1
[LightGBM] [Info] Iteration:2, training binary_logloss : 0.368703
[LightGBM] [Info] Iteration:2, training binary_logloss : 0.36674
[LightGBM] [Info] Iteration:2, training binary_logloss : 0.36853
[LightGBM] [Info] Iteration:2, training auc : 0.736444
[LightGBM] [Info] Iteration:2, valid_1 binary_logloss : 0.924669
[LightGBM] [Info] Iteration:2, valid_1 auc : 0.5
[LightGBM] [Info] 16.853904 seconds elapsed, finished iteration 2
[LightGBM] [Info] Iteration:2, training auc : 0.732943
[LightGBM] [Info] Iteration:2, valid_1 binary_logloss : 0.924669
[LightGBM] [Info] Iteration:2, valid_1 auc : 0.5
[LightGBM] [Info] 16.764278 seconds elapsed, finished iteration 2
[LightGBM] [Info] Iteration:2, training auc : 0.734199
[LightGBM] [Info] Iteration:2, valid_1 binary_logloss : 0.924669
[LightGBM] [Info] Iteration:2, valid_1 auc : 0.5
[LightGBM] [Info] 15.952404 seconds elapsed, finished iteration 2
[LightGBM] [Info] Iteration:3, training binary_logloss : 0.362164
[LightGBM] [Info] Iteration:3, training binary_logloss : 0.360501
[LightGBM] [Info] Iteration:3, training binary_logloss : 0.362294
[LightGBM] [Info] Iteration:3, training auc : 0.739219
[LightGBM] [Info] Iteration:3, valid_1 binary_logloss : 0.89998
[LightGBM] [Info] Iteration:3, valid_1 auc : 0.5
[LightGBM] [Info] 23.937869 seconds elapsed, finished iteration 3
[LightGBM] [Info] Iteration:3, training auc : 0.738611
[LightGBM] [Info] Iteration:3, valid_1 binary_logloss : 0.89998
[LightGBM] [Info] Iteration:3, valid_1 auc : 0.5
[LightGBM] [Info] 24.754654 seconds elapsed, finished iteration 3
[LightGBM] [Info] Iteration:3, training auc : 0.742167
[LightGBM] [Info] Iteration:3, valid_1 binary_logloss : 0.89998
[LightGBM] [Info] Iteration:3, valid_1 auc : 0.5
[LightGBM] [Info] 24.845918 seconds elapsed, finished iteration 3
[LightGBM] [Info] Finished training
[LightGBM] [Info] Finished training
[LightGBM] [Info] Finished training

reduce the buffer when using high dimensional data in distributed mode.

a67f686

guolinke requested a review from chivee as a code owner October 1, 2019 15:14

thvasilo reviewed Oct 2, 2019

View reviewed changes

Update dataset_loader.cpp

958ecfb

guolinke mentioned this pull request Oct 3, 2019

check the shape for mat, csr and csc in prediction #2464

Merged

guolinke added 3 commits October 3, 2019 23:20

Merge remote-tracking branch 'origin/master' into dist-memory-reduce

3d2b82a

refix

fc3e573

typo

e363201

guolinke added 2 commits October 4, 2019 08:10

fix number of bin accumulation.

b1598a0

avoid overflow

1618260

guolinke added 2 commits October 4, 2019 08:18

fix warning

7e4f2ab

efficient solution.

affcc88

guolinke and others added 2 commits October 4, 2019 20:37

Update dataset.h

3b59e76

fix bin count output

96db1a6

fix warning

a73d4fd

bug in dist number of feature check

c25557c

fix possible edge case

1847238

Update dataset.cpp

3152a3c

possible bug fix

25ac715

fix

7e1e11d

guolinke merged commit 40e56ca into master Oct 15, 2019

guolinke mentioned this pull request Oct 15, 2019

check sorted indices in Subset #2510

Merged

StrikerRUS deleted the dist-memory-reduce branch October 21, 2019 12:05

StrikerRUS mentioned this pull request Oct 23, 2019

fixed casting warning #2523

Merged

lock bot locked as resolved and limited conversation to collaborators Mar 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce the buffer when using high dimensional data in distributed mode. #2485

reduce the buffer when using high dimensional data in distributed mode. #2485

guolinke commented Oct 1, 2019

thvasilo Oct 2, 2019

thvasilo Oct 2, 2019

thvasilo commented Oct 2, 2019

guolinke commented Oct 3, 2019

guolinke commented Oct 3, 2019

thvasilo commented Oct 3, 2019

guolinke commented Oct 3, 2019 •

edited

Loading

thvasilo commented Oct 3, 2019 •

edited

Loading

guolinke commented Oct 4, 2019

guolinke commented Oct 4, 2019 •

edited

Loading

thvasilo commented Oct 4, 2019 •

edited

Loading

guolinke commented Oct 4, 2019

thvasilo commented Oct 4, 2019

guolinke commented Oct 4, 2019

guolinke commented Oct 4, 2019

thvasilo commented Oct 4, 2019

guolinke commented Oct 4, 2019

guolinke commented Oct 4, 2019

thvasilo commented Oct 4, 2019

guolinke commented Oct 4, 2019

thvasilo commented Oct 4, 2019

guolinke commented Oct 5, 2019 •

edited

Loading

thvasilo commented Oct 7, 2019

guolinke commented Oct 7, 2019 •

edited

Loading

guolinke commented Oct 8, 2019

StrikerRUS commented Oct 15, 2019

thvasilo commented Oct 18, 2019

	size_t esimate_sync_size = BinMapper::SizeForSpecificBin(config_.max_bin) * total_num_feature;
	size_t esimate_sync_size = (size_t)BinMapper::SizeForSpecificBin(config_.max_bin) * (size_t)total_num_feature;

reduce the buffer when using high dimensional data in distributed mode. #2485

reduce the buffer when using high dimensional data in distributed mode. #2485

Conversation

guolinke commented Oct 1, 2019

thvasilo Oct 2, 2019

Choose a reason for hiding this comment

thvasilo Oct 2, 2019

Choose a reason for hiding this comment

thvasilo commented Oct 2, 2019

guolinke commented Oct 3, 2019

guolinke commented Oct 3, 2019

thvasilo commented Oct 3, 2019

guolinke commented Oct 3, 2019 • edited Loading

thvasilo commented Oct 3, 2019 • edited Loading

guolinke commented Oct 4, 2019

guolinke commented Oct 4, 2019 • edited Loading

thvasilo commented Oct 4, 2019 • edited Loading

guolinke commented Oct 4, 2019

thvasilo commented Oct 4, 2019

guolinke commented Oct 4, 2019

guolinke commented Oct 4, 2019

thvasilo commented Oct 4, 2019

guolinke commented Oct 4, 2019

guolinke commented Oct 4, 2019

thvasilo commented Oct 4, 2019

guolinke commented Oct 4, 2019

thvasilo commented Oct 4, 2019

guolinke commented Oct 5, 2019 • edited Loading

thvasilo commented Oct 7, 2019

guolinke commented Oct 7, 2019 • edited Loading

guolinke commented Oct 8, 2019

StrikerRUS commented Oct 15, 2019

thvasilo commented Oct 18, 2019

guolinke commented Oct 3, 2019 •

edited

Loading

thvasilo commented Oct 3, 2019 •

edited

Loading

guolinke commented Oct 4, 2019 •

edited

Loading

thvasilo commented Oct 4, 2019 •

edited

Loading

guolinke commented Oct 5, 2019 •

edited

Loading

guolinke commented Oct 7, 2019 •

edited

Loading